-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[1.x] Backport Fix for duplicate subgraph inputs/outputs (#16131) #19112
Conversation
* fix for duplicate inputs * fixed error * fixed whitespace * Remove duplicate outputs from subgraphs * changed subgraph to create map of outputs * added static_cast * changed map<int,v> to vector * sanity fix * sanity2 * updated backends with new connectSubgraphOutputs API * fixed map creation logic * added updates for reattach function * creating node only if it is not an input to subgraph * creating object based on var_name only * updating ConnectSubgraphOutputs for mkldnn_elemwisemul_post_quantize_property.h * add debug prints to debug error in CI * remove prints * added prints to debug in the CI * revert changes * reverted changes * deduplicaated inputs to subgraph * deduplicated subgraph inputs * simplified inputs * cleaned up * deduplicate outputs * cleand up * added deduplication to subgraph node outputs * fixed prev compare * fixed issue with inputs and added test * fixd whitespace, removed prints Co-authored-by: Sam Skalicky <[email protected]> Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Manu Seth <[email protected]> Co-authored-by: Ubuntu <[email protected]>
Hey @samskalicky , Thanks for submitting the PR
CI supported jobs: [miscellaneous, windows-cpu, unix-cpu, edge, unix-gpu, website, windows-gpu, centos-gpu, centos-cpu, sanity, clang] Note: |
@ZhennanQin @pengzhao-intel Im debugging a test failure with this PR:
https://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-cpu/branches/PR-19112/runs/1/nodes/296/steps/781/log/?start=0 But now im getting a segfault:
I debugged further and found the segfault was coming from here: |
There seems to be a lot of hard coded magic numbers in the subgraph/mkldnn. @HahTK I dont think we're going to be able to get this PR merged in time for the v1.8 code freeze without some help. Worst case we'll just get this later on master in 2.0. |
@HahTK ive added support for specifying |
std::vector<nnvm::NodeEntry> *unique_orig_entries, | ||
std::vector<nnvm::NodeEntry*> *unique_input_entries, | ||
const bool skip_var = false, | ||
const bool dedup = false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment applies everywhere where we set the defaults for the dedup flag.
It seems like this is more a problem with MKLDNN.
I understand they have done that for master branch.
Ideally, they fix it at their end for 1.8 as well.
Failing that, we should go with default = true, except when compiled for MKLDNN and optimize_for() is called by CPU. All other paths would want to optimize away the dupe input.
This amounts to MKLDNN patching their fix.
GIven the timelines for MXNet 1.8, this may be too tight anyway.
More importantly, the MKLDNN issues are fixed in master.
I am ok going ahead with this as is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@samskalicky I've working on enabling MKLDNN partitioning on master and discovered that this PR broke this mobilenet_struct_v2 test. I haven't seen this PR before, so I'm sorry for late response: I've described where problem is in the picture above. |
Hi @bgawrych sorry looks like I tagged the wrong people, will start tagging you for MKLDNN related issues in the future.
The problem that I ran into in v1.x was that the mkldnn partitioning flow was expecting the input to be duplicated, and was playing around with wiring up the the subgraph nodes. I wasnt able to decode the current flow and make it work with deduplicated inputs. Here are two of the places where the code is expecting inputs to be duplicated: Ideally, I would want to refactor the code so that there no baked in expectations. And however the subgraph is partitioned, will work with the subgraph op (ie. |
In the meantime, i'll work on porting the
|
* initial commit * update build_subgraph * added test * calling test Co-authored-by: Ubuntu <[email protected]>
Description
This PR started out as a backport #16131 to v1.x. But then ran into some issues with MKLDNN subgraphing. I wasnt able to enhance the MKLDNN subgraphing to work with deduplicated inputs/outputs since there were too many hard-coded magic numbers to decode. Instead I added a flag
dedup_subgraph
to enable deduplicating subgraph inputs/outputs for our needs, but disable it by default to keep current MKLDNN/TensorRT and any other partitioning flows.So now we set an attribute on the
Graph
and check for it in build_subgraph.cc "dedup_subgraph":We set this attribute on the
Graph
inMXOptimizeForBackend
:And it is set by users as an argument to
optimize_for
: